knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
First I loaded the wine data and simply just look at the variables in the data. At the later section the structure of the data is examined.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
I have plotted the histogram of each variables. Just to get a quick look at the data.
At first glance it can be seen that there is a pick in most of the features. However sugar and sulphates and alcohol and probably pH show multiple picks which might be indication of multimodal distribution. I will investigate this further.
Let’s look at alcohol content. I start by looking at the summary of alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The median and the mean of the alcohol are close to each other and about 10.50. Let’s try different binwidth to see if there is something interesting In this plot data is skewed to the right The binwidth is 1. however let’s make the binwidth even smaller.
With the smaller binwidth , it can be seen that the data is probably bi modal , even maybe tri-modal ?
The highest pick is at about 9.4 which is close to the median value. This is probably a bimodal distribution.
## Mode of Alcohol: 9.4
The mode is about 9.4, which describes the peak around 9.4 in the previous plot.
Since data is still skewed let’s do a transformation.
The plot shows a peak around 10. This plot confirms that alcohol is bimodal distribution. Let’s look at alcohol distribution according to quality.
Very few wines in quality group 3(worst) and quality group 9 (best). The trend is that the higher the percentage of alcohol the better quality wine. for quality 4 , 5 the data is skewed to the left. Which means there is less alcohol in lower quality wines. Quality 6 which is average is still a bit skewed to the right. Quality 7, 8 , 9 the data is skewed to the left meaning the higher the alcohol content the better the quality of wine is.
Is this why we have bi modal distribution in the alcohol feature?Lower Alcohol content with lower quality wine and higher alcohol content for higher quality wine. Let’s do the same analysis for sugar , pH and sulphate.
Let’s look at the sugar summary and mode.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## Residual sugar mode: 1.2
Next Let’s plot the histogram of residual sugar with a 0.1 binwidth The plot only shows the data between 0-20 since the upper part between 20 and max(65) had very few wines. It is not normally distributed and it is over dispersed. in order to correct this , I will try a log10 transformation:
The sugar residuals has a bimodal distribution. The highest peek is at 1.2 ( mode of data).This plot shows that there are two group of wines. one with lower sugar with a peek at 1.2. And another one with higher sugar content at around 7.5.
The plot of sugar for each quality shows that the bimodal characteristic of the data is saved for each quality group. It shows that in each quality group there are wines with more sugar and there are wines with less sugar content. I think since quality group 3 and 9 is very sparse that is why the trend is not quite seen in these two groups.
The sulphate is slightly skewed to the right. It has a Mean around 0.5.
Ph is normally distributed and it shows that wines with a mean of 3.1.
The distribution for each quality group is similar. group 9 and 3 has very little observations. That is why the distribution is different.
The data set is made of 13 variable. The variable X is just the tested wine. There are 12 other variables. Quality variable is a categorical variable, which scores the wine from 0-10.
Probably the main feature in the wine data set is it’s quality. Since the other features can be used in training a model to estimate the quality of the wine. It can be seen as a classification problem( 10 class of quality). It can also work as a regression model in which we try to predict the quality of the wine. It is interesting to see if there is a correlation between any of the features and the quality. However this does not imply causation.
Other features that was interesting to see for me was , sugar , alcohol , pH and sulphates. I saw irregularities in this features so i plotted them in more details. Alcohol content is one that was of interest. The higher quality wine tend to have more alcohol content. The wines seem to be divided in two groups one with lower sugar content and another group with a higher sugar content.
For this part I did not. However the three acidity could have been made into one variable acidity. i did not make this since the fixed acidity (tartaric acid) has a lot higher value than volatile acidity (acetic acid ) and citric acid.
The wine data set was quite tidy. The alcohol had a bi modal distribution. For plotting a couple of log10 transformation was done in order to see the patterns better.
In the bivariate section i would like to look at the correlations of features since this will make a guideline for what features to plot and what features are actually influencing the quality of wine.
## missing values : 0
Since I don’t have any missing value the function cor was used to create the correlation matrix
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.02 0.29
## volatile.acidity -0.02 1.00 -0.15
## citric.acid 0.29 -0.15 1.00
## residual.sugar 0.09 0.06 0.09
## chlorides 0.02 0.07 0.11
## free.sulfur.dioxide -0.05 -0.10 0.09
## total.sulfur.dioxide 0.09 0.09 0.12
## density 0.27 0.03 0.15
## pH -0.43 -0.03 -0.16
## sulphates -0.02 -0.04 0.06
## alcohol -0.12 0.07 -0.08
## quality -0.11 -0.19 -0.01
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.09 0.02 -0.05
## volatile.acidity 0.06 0.07 -0.10
## citric.acid 0.09 0.11 0.09
## residual.sugar 1.00 0.09 0.30
## chlorides 0.09 1.00 0.10
## free.sulfur.dioxide 0.30 0.10 1.00
## total.sulfur.dioxide 0.40 0.20 0.62
## density 0.84 0.26 0.29
## pH -0.19 -0.09 0.00
## sulphates -0.03 0.02 0.06
## alcohol -0.45 -0.36 -0.25
## quality -0.10 -0.21 0.01
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.09 0.27 -0.43 -0.02 -0.12
## volatile.acidity 0.09 0.03 -0.03 -0.04 0.07
## citric.acid 0.12 0.15 -0.16 0.06 -0.08
## residual.sugar 0.40 0.84 -0.19 -0.03 -0.45
## chlorides 0.20 0.26 -0.09 0.02 -0.36
## free.sulfur.dioxide 0.62 0.29 0.00 0.06 -0.25
## total.sulfur.dioxide 1.00 0.53 0.00 0.13 -0.45
## density 0.53 1.00 -0.09 0.07 -0.78
## pH 0.00 -0.09 1.00 0.16 0.12
## sulphates 0.13 0.07 0.16 1.00 -0.02
## alcohol -0.45 -0.78 0.12 -0.02 1.00
## quality -0.17 -0.31 0.10 0.05 0.44
## quality
## fixed.acidity -0.11
## volatile.acidity -0.19
## citric.acid -0.01
## residual.sugar -0.10
## chlorides -0.21
## free.sulfur.dioxide 0.01
## total.sulfur.dioxide -0.17
## density -0.31
## pH 0.10
## sulphates 0.05
## alcohol 0.44
## quality 1.00
This shows the correlation matrix of the wine data. However this is actually difficult to interpret. In the next section I try to visualize this.
It is easier to see the correlation in the plot above. for example alcohol has a strong negative correlation with density which is around -0.78. Also with a value of 0.84 density is highly correlated with residual sugar. Alcohol seems to be influenced by density , total sulphur dioxide and sugar. pH is negatively correlated with fixed acidity , which makes sense. The higher the acidity the lower the pH gets. Another correlation is between free sulphur and total sulphur which is again obvious since the free sulphur is a part of total sulphur. ### Alcohol vs. density
Alcohol content and density have strong negative correlation with a value of -0.78. The Pearson correlation measures the linear relationship between the data. As seen in the plot a line was fitted to the data points to show the linear relationship of alcohol and density.
The plot also shows a linear pattern. However the plot can be better. let’s only show the 95% quantile.
Much better. Here again the linear correlation of residual sugar and the density is shown. It seems that density is highly correlated to alcohol and sugar. This make sense as the density is measured as the amount of alcohol and sugar in each wine. Therefore if the data was to be used for training later , density can be replaced by alcohol and sugar or alcohol and sugar can be replaced by density.
This is one of the relationship that i was not expecting. Alcohol content is negatively correlated with salt content of the wine. It has a value of -0.36.
##
## Pearson's product-moment correlation
##
## data: wd$alcohol and wd$chlorides
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3843183 -0.3355673
## sample estimates:
## cor
## -0.3601887
I first created a new data set that has grouped each quality wine into a group and then calculated the mean and median of alcohol content. The quality of the wine gets higher with alcohol content. However it first drops then raises again.
I again grouped the wines by the quality group so they would be better distributed. The bins (2,5] is shown in the plot with low quality, bin (5,6] shows average wine group and group (6,9] shows the high quality wine. As can seen in the plot above, high quality wines have a higher alcohol content.Since the group 6 only consist of one point ( average of group 6) it is seen in the data as a line and not a boxplot. I decided to put group 3, 4 and 5 together since there were very few observations in group 3 and 4.
In this plot also can be seen that Higher Quality wines have lower chloride. # Bivariate Analysis
The quality of has the highest correlation with alcohol content. and it has a negative correlation with density. Since alcohol also has a negative correlation with density. It was expected that the quality also has a correlation with density.
The alcohol content had a negative correlation with chloride, which was interesting to see.Since I did not expect that the “the amount of salt in the wine” could affect the alcohol level. The other relationship were to be expected. for example higher pH yields lower acetic acid. ### What was the strongest relationship you found? The strongest relationship is between alcohol which has a strong negative correlation with density around -0.78 and density is highly correlated with residual sugar with a value of .84.
As in the sections before, the alcohol content is the feature that drives these two plots. Since it is hard to see a clear distinction between the 7 quality groups, the plot on the right side which divides the wines in three group of low , average and high quality. Here however can be seen that the all group wines have different sugar content , meaning that sugar has not a strong influence on the quality. However once again I show that alcohol content has a strong influence on wines.
In these plots again we see that quality has a negative correlation with quality and a positive correlation with alcohol, meaning high quality wines tend to have higher alcohol content and lower chloride.
This plots shows that grand mean is closest to the group 6 or the average wine group. This because as seen before group 6 has the highest count and therefore can influence grand mean. This two plots also confirm that higher quality wine tend to have more alcohol and lower chlorides.
Since after alcohol the strongest (negative) correlation was between density and chlorides, I have plotted these two feature against each other and as can be seem again , the lower the density and the chlorides the higher the quality.
The feature total.sulfur.dioxide and acetic acid are not showing any obvious division between the quality groups. However this was to be expected since they are not strongly correlated with the value.
Wine quality has a negative correlation with the following chloride, density
acetic acid ( volatile.acidity) total.sulfur.dioxide
and a higher positive correlation with Alcohol.
I thought the relationship between chloride and alcohol content was interesting. They are negatively correlated.
The three plots that summarize the findings in the best way will be shown:
I was like to plot the correlation in a heat map or even better like the above plot. It makes it very easy to see the correlation of pairwise features. This plot was a guideline For me to do the bi variate and multivariate analysis, since it shows the strongest correlations and if they make sense. For example by just looking at this plot I can see that pH has a negative correlation with fixed acidity. This is true since the lower the acidity is the higher the pH. However this plot is only showing the Pearson correlations and if there are non linear correlations this plot is not able to show that.
This plot shows that the higher the alcohol content is the better the wine quality gets. I think this might be True because the longer the wine gets fermented the higher the alcohol content gets and higher the quality. However this is just a hypothesis. ### Plot Three
The plotting was hard since there were many data points and even with different alpha measures the plot would still not be very good. In the above however we can see that the higher alcohol content and lower chlorides tend to have higher quality groups, the blue points and the lower alcohol and higher chloride tend to have lower quality the red points. the average wines are kind of in the middle ( the green points)
In this data set there were 12 features that would influence the quality of the data. I have done multi-variate, bivariate and mulitivariate analysis on the data. I first read the text data that came with the data just to see what the authors had in mind. This helped me later in understanding the relationships some of the features had together. The wines in group 3, 4 and 9 were a very small number, which made it difficult to see always a clear trend. I tried a different grouping and tried to make a more or less closer group size for each of low (2,5], average(6,7] and high wines(6,9].
My findings : 1. The alcohol is the main feature influencing the wine quality. The higher the alcohol content the better the wine is. 2. The amount of salt in the wine puts the wine in a lower quality group. 3. Sugar left in the wine after fermentation can vary and it does not affect the quality. 4. Density is how much alcohol and how much sugar is left in the wine. It is strongly correlated by sugar and alcohol. 5. volatile acidity or acetic acid if it goes high the wine gets a vinegar like taste. However in the correlation plot there is a very small correlation between the volatile acidity and the quality of wine. 6. The salt and alcohol have a negative correlation meaning the higher the chlorides the lower the alcohol content. 5.pH
The hardest part in this project was the univariate and try to make sense of the data with transformation. I tried log10 and sqrt transformation but it does not come naturally which one to use when. Also since this was my first time using R it was a bit hard to be fast, however plotting in R is very easy and I really like how you can add layers on top of each other, which makes visualizing a very easy task. Also since I am not drinking wine it was hard to imagine how a wine can how salt in it.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. https://rpubs.com/watseob/EDA https://github.com/DariaAlekseeva/Red_and_White_Wine_Quality/blob/master/Wines.Rmd https://github.com/dpipkin/udacity-wine/blob/master/P4.rmd https://www.r-bloggers.com/r-using-rcolorbrewer-to-colour-your-figures-in-r http://www.sthda.com/english/wiki/ggplot2-title-main-axis-and-legend-titles